arXiv:1503.02098vl [stat.ME] 6 Mar 2015 


Variations of Q-Q Plots — The Power of our 

Eyes! 

Adam Loy, Lendie Follett, Heike Hofmann* 

September 8, 2014 


Abstract 

In statistical modeling we strive to specify models that resemble data collected in 
studies or observed from processes. Consequently, distributional specification and pa¬ 
rameter estimation are central to parametric models. Graphical procedures, such as 
the quantile-quantile (Q-Q) plot, are arguably the most widely used method of dis¬ 
tributional assessment, though critics find their interpretation to be overly subjective. 
Formal goodness-of-fit tests are available and are quite powerful, but only indicate 
whether there is a lack of fit, not why there is lack of fit. In this paper we explore 
the use of the lineup protocol to inject rigor to graphical distributional assessment 
and compare its power to that of formal distributional tests. We find that lineups of 
standard Q-Q plots are more powerful than lineups of de-trended Q-Q plots and that 
lineup tests are more powerful than traditional tests of normality. While, we focus on 
diagnosing non-normality, our approach is general and can be directly extended to the 
assessment of other distributions. 
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1 Introduction 


In statistical modeling we strive to specify models that resemble data collected in studies or 
observed from processes. Consequently, distributional specification and parameter estima¬ 
tion are central to parametric models. The statistical modeling process is cyclical (Tukey, 
1977), so after parameters are estimated and the model is checked, the process might con¬ 
tinue through another cycle with a refined model formulation. Model checking is central to 
statistical modeling; in particular, any conclusions based on a model depend on correct dis¬ 
tributional specifications. For example, prediction intervals in the classical regression setting 
depend directly on the assumption of normality, so they are quite sensitive to departures 
from normality. 

Graphical procedures, such as the quantile-quantile (Q-Q) plot (Wilk and Gnanadesikan, 
1968), are arguably the most widely used method of distributional assessment, though critics 
find their interpretation to be overly subjective. Formal goodness-of-fit tests are available 
and are quite powerful, but only indicate whether there is a lack of fit, not why there is lack 
of fit. For example, the Shapiro-Wilk test (Shapiro and Wilk, 1965) is a powerful test of 
normality, but does not indicate what feature of the distribution is non-normal, so a plot, 
such as a Q-Q plot, must be rendered after any rejection. 

In this paper we explore the use of the lineup protocol (Buja et ah, 2009) to inject rigor 
to graphical distributional assessment and compare its power to that of formal distributional 
tests. We focus on diagnosing non-normality, so our discussion centers around the normal 
Q-Q plot, but our approach is general enough and can be directly extended to the assessment 
of other distributions. 

We will first discuss tests for normality, both from a numerical and graphical viewpoint, 
and then formally introduce the lineup protocol in the setting of quantile-quantile plots used 
for this paper. 

1.1 Classical tests of normality 

Numerous tests have been proposed to test whether a random sample comes from a normal 
distribution. In this section we review commonly used tests of normality. 

A series of distributional tests focuses on the difference between the empirical and the- 
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Table 1: Four prominent tests for normality based on the difference between empirical and 
hypothesized distribution function. An overview of the performance and power of these tests 
can be found in Stephens (1974). 


Test 


Statistic 

Kolmogorov-Smirnov 

D = 

SU Pl<i<n 1 F n (Xi) - F(Xi) | 

Lillicfors 

D = 

supi< i<n \F n ( Xi ) - F(x t ) \ 

Anderson-Darling 

A = 

n /_ + “ | F n (x) - F{x) | 2 / (F(x)( 1 - F(x)) dF{x) 

Cramer-von-Mises 

C = 

nj-™ \F n (x) - F(x)\ 2 dF(x) 


oretical distribution functions. More formally, let F n be the empirical distribution function 
(ECDF) based on a sample size of n, and F be the hypothesized/true distribution. The abso¬ 
lute difference between the two distribution functions for each sample point, | F n (xi) — F(xj) |, 
is the main contributor for the test statistics of the Kolmogorov-Smirnov (KS-test, Kol¬ 
mogorov, 1933; Smirnov, 1948), the Lillicfors (LF-test, Lillicfors, 1967), the Anderson- 
Darling (AD-test, Anderson and Darling, 1954), and the Cramer-von-Mises tests (CVM-test, 
Cramer, 1928; von Mises, 1928), as shown in table 1. 

The KS test uses the maximal difference, regardless of the range of the sample—i.e. 
a difference, D, observed in either tail of the distribution carries the same weight and is 
interpreted in the same way as a difference, D , in the center of the distribution. While the 
KS test allows for the adjustment of the parameters of the normal distribution to the sample 
mean and variance, it is more appropriate to use the LF test for this purpose. LF and KS 
share the same test statistic, but the sampling distribution in the LF test statistic is adjusted 
for the two additional parameters. AD and CVM are both based on the total area between 
the hypothesized distribution function and the empirical distribution function. Compared 
to the KS test, the CVM test downplays the effects in the tails of a (normal) distribution, 
while the AD test upregulates the tail effect using a weighting of 1/ (F(x)( 1 — F(x)) across 
the range of the sample. 

The Shapiro-Wilk test (SW-test, Shapiro and Wilk, 1965) does not utilize deviations 
from the theoretical distribution function, rather it focuses on the linearity of a normal Q-Q 
plot. Under normality, a set of observations, xi,... ,x n , can expressed as Xi = n + crzi, where 
Zi is a quantile from the standard normal distribution. The Shapiro-Wilk test compares (up 
to a constant of proportionality, c) two estimates for cr: the best linear unbiased estimate 
obtained from a generalized least squares regression of the sample order statistics on their 
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expected values, denoted a, and the sample standard deviation, s. 


W = 


ca 


b 2 


For a sample drawn from a normal distribution, b 2 and s 2 are, up to a constant, estimating 
the same quantity, whereas the two estimators will generally not be estimating the same 
quantity under non-normality. The SW test has been shown to be the most powerful in 
assessing non-normality (Stephens, 1974; Razali and Wall, 2011). 

In Section 4 we will return to these tests in order to assess the effectiveness of different 
variations of standard Q-Q plots. 


1.2 Q-Q Plots 

Standard quantile-quantile (Q-Q) plots (Wilk and Gnanadesikan, 1968) are an essential 
tool for visually evaluating a specific distributional assumption. A Q-Q plot is constructed 
from a sample, xi,...,x n , by plotting the theoretical quantiles, F _1 (F n (a;i)), against the 
sample quantiles, iw. If the empirical distribution, F n , is consistent with the theoretical 
distribution, F, the points in the Q-Q plot fall on the line of identity. For any sample 
tested against a distribution within a location-scale family, such as a normal, log normal, or 
exponential distribution, the sample quantiles still fall on a line when plotted against the 
theoretical quantiles of any of the family’s member distributions. Plotting the empiricial 
quantiles of a normally distributed sample x ~ iV(/x, a 2 ) against the quantiles of a standard 
normal will result in a line, where the slope is an estimate of cr , and the intercept estimates 
fi. Visually the only change in the Q-Q plot is a change in the scale of the y-axis. We 
can therefore employ Q-Q plots in the more general framework of testing the distribution 
of a sample for normality similar to standard normality tests, such as the AD, LF, CVM, 
and SW tests. We do have to make a decision with respect to the exact parameters of the 
normal distribution we test against when we plot a line alongside the points in the Q-Q 
plot for additional comparison purposes, i.e. the parameters /i and cr have to be estimated 
from the sample. In Q-Q plots, variability is based on a robust measure of spread given as 
the ratio of the inter-quartilc ranges (IQRs) of the empirical and theoretical distributions: 
(F-QO.TS) - F“ 1 (0.25)) / (F-QOTS) - F^O^)) (Becker et ah, 1988). 
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Based on a visual inspection in a Q-Q plot, a sample is therefore considered to be con¬ 
sistent with a normal distribution if the empirical and theoretical quantiles fall close to the 
line representing the theoretical distribution. This decision is helped additionally by an as¬ 
sessment of whether the points fall inside the envelope of 95% pointwise confidence intervals 
(Davison and Hinkley, 1997, p. 150-154). 

In assessing differences between points and lines, onlookers have a tendency to evaluate 
the shortest, i.e. orthogonal, distance, even when asked to evaluate differences based on 
vertical distance (Vander Plas and Hofmann, 2014; Robbins, 2005; Cleveland and McGill, 
1984). In so-called de-trended Q-Q plots (Thode, 2002, p. 25-26) the y axis is changed to 
show the difference between theoretical and sample quantiles. The line of the theoretical 
distribution therefore falls onto the x axis, and (vertical) differences between empirical and 
theoretical distribution coincide with the orthogonal distance. De-trending should aid in the 
visual assessment between the empirical CDF and theoretical CDF. This also follows the 
general standard graphical recommendation to directly plot the aspect of the data we want 
to show rather than asking audiences to derive it (Wainer, 2000). Another point in favor of 
this design is that it makes better use of the available space. 

In this paper we investigate the effectiveness and power of the modifications made to 
Q-Q plots. Examples of the three versions of Q-Q plots under consideration are displayed in 
Figure 1, and include (from left to right): a control Q-Q plot, a standard Q-Q plot with an 
added grey band representing a 95% pointwise confidence region (Davison and Hinkley, 1997) 
based on the estimated standard error of the order statistics for an independent sample from 
the theoretical distribution, and a de-trended Q-Q plot. Note that all Q-Q plots in Figure 1 
are constructed from the same data. 

In order to objectively evaluate the three designs and quantify their effectiveness we make 
use of lineup tests. 

1.3 Lineup Tests 

Lineup tests have been introduced by Buja et al. (2009) to evaluate and quantify the sig¬ 
nificance of graphical findings. The idea behind a lineup test is that of a police lineup: the 
chart of the observed data is placed randomly among a set of so-called null charts , showing 
data created consistently with the null hypothesis. In the setting of a lineup of normal Q-Q 


5 



Figure 1: Three versions of Q-Q plots: control, standard, and de-trended. 

plots, the null hypothesis is either that F is standard normal or that F is normal with pa¬ 
rameters based on sample mean and variance. If the ‘suspect’—i.e. the plot of the observed 
data—can be identified from the null charts, this counts as evidence against the null hypoth¬ 
esis. Multiple identifications of the data by independent observers then lead to a rejection 
of the null hypothesis. The lineup protocol also allows for an assessment of the power of a 
lineup (Majumder et ah, 2013), and by showing different renderings of the exact same data 
in lineups we can evaluate the power of different designs (Hofmann et ah, 2012). 

In considering the power of a lineup, we need to estimate the probability, /y, that observer 
i identifies the data from the lineup. If the observer is just guessing, this probability is 1/m, 
where m is the number of plots in the lineup. The power of a lineup is then given as the 
probability to reject the null hypothesis. Let Y be the number of identifications of the data 
plot in N independent evaluations, and let Y ~ F^. The power of the lineup is then the 
probability that more than y a out of N observers choose the true plot, or more formally 

Power = Power tv = 1 — Fy(y a ), (1) 

where y a is the critical value for a given significance level a, i.e. P(Y > y a ) < a. Y is 
composed of the sum of N observers’ (binary) decisions Yi ~ B i iPi , where pi is the probability 
that individual i chooses the data plot. This probability depends both on the strength of 
the signal in the data plot and an individual’s visual ability. Assessing this ability requires 
that each individual evaluates multiple lineups. If that is not possible, we must assume that 
all participants share the same ability, p. Similar to classical inference, we can make use of 
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power to assess the sensitivity of tests. This allows us to make decisions about designs for 
particular tasks by evaluating lineups displaying the same data in different types of displays 
(Hofmann et ah, 2012). 

In the next section, we describe the simulation study used to compare the three Q-Q 
plot designs. An initial comparison of the three designs is also given. We use a generalized 
linear mixed model to compare the power of the three designs in section 3, and also explore 
the feedback of the independent observers to compare the rationale for plot selection (i.e., 
rejection of normality). Finally, we compare the power of a lineup test of normality to the 
classical normality tests in section 4, and outline areas for future research in section 5. 

2 Simulation Setup and Model 

To further develop the assessment of normality using lineups, we conducted a study com¬ 
paring the three different versions of the Q-Q plot. 

To investigate the power of the three different Q-Q plot versions, we sampled data from 
a t distribution with varying degrees of freedom and sample sizes, and included a Q-Q plot 
of these data in a lineup of null charts drawn from standard normal samples of the same 
size. For lineup tests it is of extreme importance to consider the generation of the null sets 
and the construction of the plots in the lineup. Null data is created conistently with the null 
hypothesis. Here, we have two different null hypotheses to consider: 

• Situation I: H 0 : F — iV(0,1) 

Null samples are drawn from a standard normal distribution; the reference line is the 
line of identity. Lines and envelopes are the same across all panels, in particular, all 
panels have the same scale. 

• Situation II: Hq : F = N(0,S 2 ) 

S is based on the interquartile range of the data; null samples are drawn from N(0, S' 2 ). 
The reference line has a slope of S (and an intercept of 0). All panels have the same 
scale. 

Examples for both hypotheses are shown in figure 2. Both lineups show the same dataset 
(in panel #(3 2 — 3)). On the left the data stand out (all 33 observers picked the data plot), 
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i.e. we reject the null hypothesis of a standard normal distribution. On the right, the data 
do not stand out (only 3 out of 27 observers picked the data); thus, we do not reject the 
hypothesis of a normal distribution with parameters p = 0 and a = 1.578. 

Note that the above list of hypotheses is not exhaustive. Any theoretical distribution in 
Q-Q plots corresponds to a hypothesis test against that distribution. As long as there is a 
method to generate samples under the null hypothesis, we do not even need to know the 
exact distribution. This allows us to assess situations in which we only have approximate or 
asymptotic results, which are hard, if not impossible, to investigate with the (small) finite 
samples we typically deal with in practice. 

Note that 1QR is used here in estimating scale—this is standard practice for Q-Q plot. 
Robust estimation of the variance is preferred for better assessment of the tails and outliers of 
the empirical distribution. We could use alternative estimators for variance, such as median 
absolute deviation (MAD) or adjusted MAD (Rousseeuw and Croux, 1993), but this will 
likely also change the power of the corresponding lineup. 



y • •• . 

/./ / / / 



X X X 


11 12 13 14 

16 17 18 19 20 

///// 


Figure 2: Lineup plots of standard Q-Q plots. The observed data is the same, but reference 
lines and envelopes are based on a standard normal distribution on the left; while reference 
lines and envelopes for the lineup on the right are based on a normal distribution N(0,S 2 ), 
where S is based on the IQR of the observed data. The observed data in both lineups is 
displayed in panel #(3 2 — 3). 


Next, we model the aforementioned probability pi with which observer i picks the true 



data from a lineup. Let X, ~ Bi n ., 1 < i < n, where is the binary decision on the ?'th 
evaluation and 7r* is the probability with which the observer chooses the data plot. This 
probability is influenced by a number of factors: 

r the design used in the lineup (Control, Standard, De-trended), 

the specific parameters under which the data for the lineup were cre¬ 
ated: 

5 degrees of freedom (2, 5, 10) of the t distribution and 
v sample size (20, 30, 50, 75), 
d the level of difficulty based on the actual sample, and 
u the users’ subjective abilities. 

The combination of different levels of sample size and degrees of freedom of the t dis¬ 
tribution result in 12 parameter settings. Under each setting, we generated data twice. 
Additionally, we made use of two different sets of null data for each sample, yielding 48 
different sets of data, which we render in each of the variations, resulting in 144 different 
lineups. 

Using Amazon MTurk (Amazon, 2010), 674 independent observers were recruited and 
asked to evaluate ten lineups each. Half of the lineups that observers were shown allowed 
multiple choices of plots from a lineup for the final answer. While most participants still 
chose only a single plot, in the analysis we dealt with multiple answers to a lineup by using a 
weighting variable defined as the reciprocal of the number of answers given by a participant. 

## Loading C code of R package 'Rmpfr 1 : GMP using 64 bits per limb 

Figure 3 shows proportions of correct evaluations of the lineups under the three different 
variations of Q-Q plots. All three versions provide highly correlated results, and largely agree 
for extreme decisions (all correct/all wrong evaluations). In the middle range de-trended Q- 
Q plots perform worse than either standard or control Q-Q plots. Lineups of data samples 
from a £ 2 -distribution are all rejected in lineups under the standard design, while most of the 
t \o samples go undetected. Lineups showing a sample from a t§ distribution cover the whole 
range. We base an evaluation of the three different Q-Q plot designs on the premise that 
if participants fold it easier to identify the data plot under one lineup over another lineup 
(given identical data underlying the lineups), the first lineup uses the better design. 
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Figure 3: Proportions of successful evaluation of the same data in the three different varia¬ 
tions of Q-Q plots. Standard and control displays exhibit the highest correlation. De-trended 
Q-Q plots agree with decisions made based on Q-Q plots in the control or standard design, 
but display lower rates of correct responses in the ‘middle’ field. Significances are based on 
lineup evaluations in the standard design. 


Let Yi be the outcome of the zth evaluation. Then Yj is a Bernoulli variable, where 7Tj 
denotes the probablity of identifying the data plot from the lineup; i.e. P(Yj = 1) = 7Tj = 
E[Yj]. The probability of identifying the lineup is affected by several factors: (a) the strength 
of the signal, i.e. degrees of freedom of the t distribution, and the sample size, (b) a human 
factor, i.e the visual ability of the observer, and (c) the ‘lineup factor’: depending on which 
m — 1 representatives of the test distribution the null plots show, lineups of the same data 
plot can have different difficulty. We capture all of this in a logistic regression model with 
fixed effects for signal strength and random effects for lineup difficulty, d, and user ability, 
u\ 


Vi V T 'Tj(j) T T T^s(i) T 7l u (i) T 
Y = g~\7]) + £ 

where g is the logit link function, and j(.),k(.),s(.),u(.), and d(.) are indexing functions 
that relate evaluation i to the corresponding levels in the factor variables, to the observer, 
or a particular data sample. More specifically, j(i) G {Control, Standard, De-trended}; 
k(i) G {2, 5,10}; s(i) G {20,30, 50, 75}; u{i) maps to the participant’s id of the ith evaluation; 
and d(i) identifies the particular data sample used. Both user ability, u, and sample difficulty, 
d, are modeled as independent, normally distributed random effects, i.e. u u ^ ~ 1V(0,<7^), 
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dd(i) ~ iV(0, cr^) with co v(u,d) = 0. We further assume that E[e] = 0 and Var[e] = a 2 . 

Table 2: Coefficients and significances corresponding to model M\. The type of design is 
important for the power of a lineup. De-trended Q-Q plots lose a significant amount of power 
compared to both the regular and the standard version of Q-Q plots. 



Estimate 

Std. Error 

z value 

Pr(> z ) 


Intercept 

-5.37 

0.769 

-6.98 

0.0000 

*** 

design 

Control 

0.00 

— 

— 

— 


Standard 

0.06 

0.103 

0.62 

0.5371 


De-trended -0.50 

degrees of freedom 

0.104 

-4.77 

0.0000 

*** 

2 

6.63 

0.752 

8.82 

0.0000 

*** 

5 

2.65 

0.732 

3.61 

0.0003 

** 

10 

0.00 

— 

— 

— 


sample size 

20 

0.00 

— 

— 

— 


30 

0.88 

0.848 

1.03 

0.3014 


50 

3.26 

0.837 

3.90 

0.0001 

*** 

75 

2.20 

0.838 

2.63 

0.0086 

** 


Signif. codes: 0 < *** < 0.001 < ** < 0.01 < * < 0.05 < . < 0.1 < ” < 1 


The estimated model coefficients for model Mi are shown in Table 2. Estimates of the 
variance components are d u = 0.44, dd = 1.95, and a = 0.31. Variances of user ability and 
data difficulty are large relative to residual variance, indicating that both random effects are 
necessary. Compared to the difficulty level of lineups, participants’ abilities only vary little. 
The difference between best and worst performance by participants has an effect of at most 
an estimated 1.9-fold probability of detecting the data plot from a lineup. 


3 Power: three different designs of Q-Q plots 

As expected, the task of identifying non-normality becomes easier with increased sample size 
and more pronounced deviations from normality due to lower degrees of freedom. The design 
of the Q-Q plot is of huge importance for the probability of choosing the data plot: compared 
to the control chart, add-on confidence bands help with evaluation in the standard design, 
but the difference is not significant. Surprisingly, the de-trended Q-Q plot is significantly less 
powerful in detecting non-normality than either of the other designs. In terms of rejections 
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of the null hypothesis, this means that normality is rejected in 24 out of the 48 lineups of 
the de-trended Q-Q plot. All of these cases are being rejected in all of the other designs as 
well, but using the control design another four lineups reject normality, and the standard 
design rejects yet another lineup. 

To further investigate the difference between the standard and the de-trended designs, 
consider figure 4. Here, we have an example of a sample that is rejected based on a lineup 
in the standard design, but not from a lineup of de-trended Q-Q plots. Instead of focusing 
on the panel showing the sample, observers focus on panel #(3 2 — 2) (with 18 out of 21 
picks). This panel was picked as being most different 9 out of 27 times in the standard 
design, too, indicating that there is something special about it, but most observers (16 out 
of 27) picked the data in panel #(2 2 + 1) from the standard design. Two of the observers 
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Figure 4: Lineups of the same data in two different designs. In the standard design the 
data sample (in panel #(2 2 + 1)) is identified 16 out of 27 times, leading to a rejection of 
normality. The same data set is only identified once out of 21 times in the lineup showing 
the de-trended design. Observers instead pick panel #(3 2 — 2) 18 times. 

picking panel #7 from the de-trended lineup thought that this panel was the one with the 
“most dots outside the shaded area,” i.e. they focused on the middle of plot #7. This is 
not a singular occurrence—when investigating overall reasoning we take a closer look at the 
effect of the words “area” and “outside” in the reason participants gave for making their 
choice of plot (figure 5). As expected, these words barely occur in the control design (where 
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there is no shaded area). It seems to help in identifying the data plot in the standard design, 
but it severely increases the chance of picking a null plot in the de-trended design. Why 
is that? The de-trended version is making better use of the space in the plot, it therefore 
emphasizes deviations of points from the £-axis (i.e. the theoretical distribution) and with 
it the fact whether individual points are inside or outside the shaded areas corresponding 
to the (pointwise) 95% confidence intervals. The responses suggest that participants take 
the shading very seriously and make their choice dependent on it. It also seems that the 
confidence bands mislead people—this suggests, that for the de-trended Q-Q plot design we 
might have to re-think how to display confidence intervals: it might be better to either use a 
more conservative confidence level or change the approach altogether from pointwise confi¬ 
dence intervals to simultaneous confidence bands as, for example, discussed by Rosenkrantz 
( 2000 ). 

Another promising approach might be to base confidence bands on the TS test (Aldor- 
Noiman et ah, 2013). These bands define a test of normality and are narrower in the tails 
than those associated with the Kolmogorov-Smirnov test, while slightly more conservative 
in the middle of the distribution. These findings coincide with our observation that points 
outside the confidence intervals were misleading participants, if this occurred in the middle 
of the plot. 



Control De-trended Standard Control De-trended Standard 
treatment 


choice 


■ data 
null 


Figure 5: Mentioning “outside” or “area” in the reason for selecting the plot from the lineup 
increases the probability of not identifying the data plot by a large factor in de-trended Q-Q 
plots. 


Figure 6 summarizes the reasons participants gave for the choice of plot they selected. 
Bars on the left show identifications of the data plot, bars on the right represent selections 
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Figure 6: Overview of reasons participants gave for their answer. “Outliers” as a reason 
drastically improves the chance of identifying the data plot. All the other reasons either 
have no effect or decrease the chance of picking the data. It is curious to see that so many 
more observers respond “left side different” over “right;” all the samples come from a t- 
distribution, so deviations in the extremes should therefore also be symmetric. 

of a null plot. The reasons represent the four reasons offered to participants in check boxes. 
Notably, the reasons do not seem to differentiate between the three designs. Stating “out¬ 
liers” as a reason for choosing a plot is helpful across all designs in picking the data out of 
the lineup. Stating “left side different” or “points curve” as the reason for choosing a plot 
decreases the chance for this plot to be the data plot. “Right side different” does not seem 
to have any effect. Interestingly, there is a big difference in the percentages of “right” and 
“left.” Participants favored to give “left side different” as a reason over “right side differ¬ 
ent”, even though all distributions involved were symmetric, and therefore deviations from 
normality should also manifest themselves in a symmetric fashion. 

4 Power: visual and classical 

It is important to recall that none of the data plots in the lineups were actually created 
using data from a normal distribution. Ideally, this should lead to rejection of the null 
hypothesis in every single instance. This is not quite true, as can be seen in Table 3, but 
what becomes evident is the high power of visual inference. Based on lineups we are able to 
reject non-normality much more often than with any of the classical tests. 
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Table 3: From left to right, we see the number of rejections from visual inference as well 
as the Shapiro-Wilk, Anderson-Darling, Lilliefors, Kolmogorov-Smirnov, and Cramer-von 
Mises tests for normality. Out of the 24 non-normal samples, 12 get rejected at the 5% 
significance level based on evaluation by observers. None of the standard normal tests come 
close to that rejection rate. The power we observe here matches the power discussion by 
Razali and Wah (2011) for the SW, AD, and the LF test. The number in parentheses is the 
number of situations in which the Standard Q-Q plot agrees with the normal test in rejecting 
normality of the sample. 


Standard SW 

AD 

LF 

CVM 

reject N(0, S 2 ) 12 8 (7) 

5(5) 

5(5) 

4(4) 


Figure 7 shows a scatterplot of p-values from the SW test and estimated p-values from 
the lineup of Q-Q plots in the standard design. Out of the 24 samples, the tests agree on 
18, of which seven are rejections. Of the remaining six, five are rejected only by the visual 
test, and one is rejected by SW, but not by the visual test. Two samples on which the tests 
disagree are circled in figure 7. The two lineups corresponding to these observations are 
shown in figure 8. The lineup on the left corresponds to a sample that is rejected by the SW 
test, but is not rejected by the visual test: only 1.5 decisions (at least one observer picked 
two panels in his/her response) out of 38 identified the data panel as the most different, 
which is not enough to reject the null hypothesis of A^(0, S' 2 ). In contrast, the data plot in 
the lineup on the right is picked by 23 out of 26 independent observers, leading to a very 
clear rejection. The corresponding p -value in the SW test is 0.2318, after LF (0.1689) the 
test with the lowest p-value on this data sample out of all the normality tests. 

The difference in significance between normality tests and the visual test might be due 
to the way the theoretical distribution against which the sample is compared is chosen. The 
normality tests are based on the sample mean and sample variance. Both of these estimates 
are affected by outliers. Compared to a normal distribution, the samples from a t distribution 
exhibit heavier tails. In a finite sample, the heavier tails might look like outliers. By taking 
these outliers into account, the normality tests lose substantial power. The Q-Q plots, on 
the other hand, are based on a robust estimate of the scale based on the middle half of the 
empirical distribution. Q-Q plots are, therefore, less affected by outliers and the tails of a 
t distribution are more easily distinguishable from the tails of a normal distribution, as can 
be seen in the lineup on the right of figure 8. Compare this to the lineup of figure 9, which 
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Figure 7: Scatterplot of p-values from the Shapiro-Wilk test and estimated p-values from 
lineups of Standard Q-Q plots. The grey shaded areas represent areas of rejection under at 
least one of the tests. The circled observations correspond to samples that lead to decidedly 
different decisions under the two tests. The lineups corresponding to these observations are 
shown in figure 8. 

is based on the same data, but the nulls are sampled from a normal distribution with a 
variance estimated as the sample variance. The data plot does not stand out, so we would 
not reject the null hypothesis based on this lineup. An inferior performance of normality 
tests based on sample mean and variance is also observed by Aldor-Noiman et al. (2013) in 
the discussion of the TS test. 

5 Discussion and Conclusion 

In the comparison of the three designs of Q-Q plots, de-trended Q-Q plots turn out to 
have significantly less power in detecting non-normality than Q-Q plots in the standard and 
the control design. This is surprising, as results from cognitive psychology suggest that 
the de-trended version has superior qualities. From the additional reasoning provided by 
participants regarding their choice of plot it becomes obvious that this choice is mainly 
driven by points outside the shaded area depicting 95% confidence intervals. This happens 
primarily in the middle of the distribution, confirming results by Aldor-Noiman et al. (2013), 
and reopens the question of whether the design or the choice of confidence calculation is 
the reason for the inferiority. It would also make sense to fix the aspect ratio of plots 
in the de-trended design to make comparisons between the range of points in both axis 
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Figure 8: On the left, results are not significant, on the right they are highly significant. 
These two lineups correspond to the results circled in figure 7. The tables below the lineups 
show the number of times each of the panels was picked as the most different. Non-integer 
numbers result from multiple choice plots. The italicized numbers refer to the panel that 
contains the actual sample. 
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Figure 9: Lineup of standard design Q-Q plots showing the data of the lineup on the right 
in figure 8. The hypothesized distribution is 1Y(0 ,<t 2 ), where a 2 is estimated as the regular 
(i.e. non-robust) sample variance, and nulls are drawn from that distribution. While not 
actually user tested, we do not think that the data stands out from the lineup. 

directions possible. All versions of Q-Q plots under consideration here are significantly 
better at detecting deviations from normality than classical normality tests. A contributing 
factor to this superior power might be that in Q-Q plots the whole sample is assessed rather 
than being reduced to the single value considered for the test statistic. Contributing to the 
power is also the robust estimation of the parameters for the normal distribution drawn as a 
line of fit in Q-Q plots, while most normality tests are based on the outlier sensitive sample 
variance. This is consistent with findings in Aldor-Noiman et ah (2013) and also poses the 
question of whether the power of classical normality tests might be improved by using robust 
estimates for the mean and variance of the sample. 

Q-Q plots are not restricted to the assessment of normality. In fact, they provide a general 
framework for testing any distributional assumptions. Used in the setting of lineups, they 
in particular allow an assessment of limiting distributions, i.e. for example, lineups allow 
us to investigate distributions of samples from approximate normal or asymptotic normal 
distributions, for which there exists no classical test for finite sample sizes, whereas the lineup 
protocol provides us with a valid testing system as long as there is a method to generate 
data under the null hypothesis for creating null plots in the lineup. 
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One of the drawbacks of the lineup test framework is that it conies at a higher cost, 
both monetary and in time, than classical testing. However, developments such as nullabor 
(Wickham, 2012) or vis.test (Snow, 2013) allow us to be, at least to a degree, our own 
testers. For tests that are of a more sensitive nature, the cost of a test using a crowd¬ 
sourcing service is certainly a small enough item in the overall project budget that it is a 
feasible option. It also discourages the analyst from multiple testing! 

Several possibilities for immediate extensions are obvious: the simulation study here is 
only concerned with deviations from normality as given by the ^distribution. Other types 
of deviations, such as skewed distributions or mixture distributions, would be interesting 
to consider as well. We doubt that the overall results would change dramatically, but it 
might provide more insight into what observers consider in making their assessments. The 
application of the lineup framework based on the related probability (P-P) plots poses a 
natural next question: Gan et al. (1991) comment on the higher sensitivity of P-P plots to 
discrepancies in the middle of the distribution, such as caused by multiple modes. We can 
verify this and other statements for large samples and based on distributions. Lineups allow 
us to quantify the extent to which these statements hold for small sample sizes. 
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